Reinforcement Learning C3.3 Delayed reinforcement learning

نویسندگان

S Sathiya Keerthi

B Ravindran

چکیده

See the abstract for Chapter C3. Delayed reinforcement learning (RL) concerns the solution of stochastic optimal control problems. In this section we formulate and discuss the basics of such problems. Solution methods for delayed RL will be presented in Sections C3.4 and C3.5. In these three sections we will mainly consider problems in which C3.4, C3.5 the state and control spaces are finite sets. This is because the main issues and solution methods of delayed RL can be easily explained for such problems. We will deal with continuous state and/or action spaces briefly in Section C3.5. Consider a discrete-time stochastic dynamic system with a finite set of states, X. Let the system begin its operation at t = 0. At time t the agent (controller) observes state† xt and selects (and performs) action at from a finite set, A(xt ), of possible actions. Assume that the system is Markovian and stationary, that is, Prob{xt+1 = y | x0, a0, x1, a1, . . . , xt = x, at = a} = Prob{xt+1 = y | xt = x, at = a} def = Pxy(a) . A policy is a method adopted by the agent to choose actions. The objective of the decision task is to find a policy that is optimal according to a well defined sense, described below. In general, the action specified by the agent’s policy at some time can depend on the entire past history of the system. Here we restrict attention to policies that specify actions based only on the current state of the system. A deterministic policy, π , defines for each x ∈ X an action π(x) ∈ A(x). A stochastic policy π defines, for each x ∈ X, a probability distribution on the set of feasible actions at x, that is, it gives the values of Prob{π(x) = a} for all a ∈ A(x). For the sake of keeping the notations simple we consider only deterministic policies in this section. All ideas can easily be extended to stochastic policies using appropriate detailed notations. Let us now precisely define the optimality criterion. While at state x, if the agent performs action a, it receives an immediate payoff or reward, r(x, a). Given a policy π we define the value function, V π : X → R as follows‡:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multicast Routing in Wireless Sensor Networks: A Distributed Reinforcement Learning Approach

Wireless Sensor Networks (WSNs) are consist of independent distributed sensors with storing, processing, sensing and communication capabilities to monitor physical or environmental conditions. There are number of challenges in WSNs because of limitation of battery power, communications, computation and storage space. In the recent years, computational intelligence approaches such as evolutionar...

متن کامل

Delayed Reinforcement, Fuzzy Q-Learning and Fuzzy Logic Controllers

In this paper, we discuss situations arising with reinforcement learning algorithms, when the reinforcement is delayed. The decision to consider delayed reinforcement is typical in many applications, and we discuss some motivations for it. Then, we summarize Q-Learning, a popular algorithm to deal with delayed reinforcement, and its recent extensions to use it to learn fuzzy logic structures (F...

متن کامل

Reinforcement Learning in Neural Networks: A Survey

In recent years, researches on reinforcement learning (RL) have focused on bridging the gap between adaptive optimal control and bio-inspired learning techniques. Neural network reinforcement learning (NNRL) is among the most popular algorithms in the RL framework. The advantage of using neural networks enables the RL to search for optimal policies more efficiently in several real-life applicat...

متن کامل

Dynamic Obstacle Avoidance by Distributed Algorithm based on Reinforcement Learning (RESEARCH NOTE)

In this paper we focus on the application of reinforcement learning to obstacle avoidance in dynamic Environments in wireless sensor networks. A distributed algorithm based on reinforcement learning is developed for sensor networks to guide mobile robot through the dynamic obstacles. The sensor network models the danger of the area under coverage as obstacles, and has the property of adoption o...

متن کامل